Overview

This is a summary of the single cell analysis performed after demultiplexing with Freemuxlet.

In general, the results are very consistent with what we had previously submitted. The analysis parameters are very similar to what was done before i.e. we didn’t have to force anything to make the results come out the same. This is good.

The bulk of the analysis was performed using Monocle3 functions with occasional modifications as noted.

If you are interested in browsing the code, here is what is in each file:

Data Preprocessing

These are the number of cells identified from the first level of freemuxlet analysis:

## # A tibble: 91 x 3
## # Groups:   doubletCalls [4]
##    doubletCalls   FNL.BEST.GUESS     n
##    <chr>          <chr>          <int>
##  1 Singlet        0,0             9621
##  2 Singlet        1,1             8361
##  3 Singlet        10,10           9615
##  4 Singlet        11,11           1808
##  5 Singlet        2,2            10983
##  6 Singlet        3,3             8497
##  7 Singlet        4,4            13052
##  8 Singlet        5,5             9577
##  9 Singlet        6,6             7720
## 10 Singlet        7,7             8709
## 11 Singlet        8,8            10098
## 12 Singlet        9,9             7322
## 13 Freemuxlet.DBL 1,0              828
## 14 Freemuxlet.DBL 10,0             889
## 15 Freemuxlet.DBL 10,1            1099
## 16 Freemuxlet.DBL 10,2             992
## 17 Freemuxlet.DBL 10,3             875
## 18 Freemuxlet.DBL 10,4            1086
## 19 Freemuxlet.DBL 10,5            1423
## 20 Freemuxlet.DBL 10,6            1211
## # … with 71 more rows

The FNL.BEST.GUESS corresponds to the cluster assignment provided directly by freemuxlet. See “freemuxlet_data_out/fml_metadata_summary1.csv”

After incorporating Ravi’s genotyping information, the singlets were assigned to the patient id’s we are familiar with (see “freemuxlet_data_out/singlet_only_summary.csv”:

## # A tibble: 12 x 3
##    VWsample     n running_total
##    <chr>    <int>         <int>
##  1 E01      18329         18329
##  2 E02       6054         24383
##  3 E03      17680         42063
##  4 E04      14258         56321
##  5 E05       7353         63674
##  6 E06       9216         72890
##  7 E07       2108         74998
##  8 E09       6675         81673
##  9 E10       2428         84101
## 10 E11       9491         93592
## 11 E12       4262         97854
## 12 E13       7400        105254

QC was performed to remove cells with a low number of features or genes, and a high percentage of mitochondrial RNA. Cutoffs were determined using the defaults for the “isOutlier” function in scater which is 3 MADs in the bad direction. Each calculation was performed on individual 10X runs (e.g. “cyan1”).

Distibution of features prior to thresholding:

channels_nfeature_rna_cds5

Feature thresholds applied:

qc_thresholds_features
## # A tibble: 10 x 3
##    channel lower higher
##    <chr>   <dbl>  <dbl>
##  1 cyan1   105.     Inf
##  2 cyan2    93.6    Inf
##  3 blue1   129.     Inf
##  4 blue2   175.     Inf
##  5 green1  224.     Inf
##  6 green2  256.     Inf
##  7 red1    119.     Inf
##  8 red2    117.     Inf
##  9 yellow1 156.     Inf
## 10 yellow2 198.     Inf

Distribution of features after thresholding:

channels_nfeature_rna_cds6

Distribution of mitochondrial RNA percentage prior to thresholding:

channels_pct_counts_mito_cds5

qc_thresholds_mito
## # A tibble: 10 x 3
##    channel lower higher
##    <chr>   <dbl>  <dbl>
##  1 cyan1    -Inf   17.2
##  2 cyan2    -Inf   16.4
##  3 blue1    -Inf   17.2
##  4 blue2    -Inf   16.6
##  5 green1   -Inf   12.3
##  6 green2   -Inf   12.9
##  7 red1     -Inf   12.1
##  8 red2     -Inf   12.6
##  9 yellow1  -Inf   12.6
## 10 yellow2  -Inf   11.3

Distribution of mitochondrial RNA percentage after thresholding:

channels_pct_counts_mito_cds6

These show that the cells were good quality basically across the board. I broke it down by patient and the results were very similar. I can send those plots if you are interested.

The following is a table of cds cell numbers at all steps of preprocessing:

cds_cell_numbers
## # A tibble: 9 x 3
##   cds_vec description_vec                                     nrow_vec
##   <chr>   <chr>                                                  <int>
## 1 cds0    made from combine_cds                                 175667
## 2 cds1    added metadata columns                                175667
## 3 cds2    filtered out cells without freemuxlet data            175403
## 4 cds2.1  filtered for singlets in doubletCalls                 105363
## 5 cds2.2  filtered out ambiguous cells from cyan 1 and cyan 2   105254
## 6 cds3    calculated QC metrics                                 105254
## 7 cds4    added metadata columns                                105254
## 8 cds5    removed metadata columns                              105254
## 9 cds6    removed low quality cells                             103415

So in summary, there were many cells filtered out by freemuxlet and doublet finder. About 1% of cells failed qc.

CSVs with these data can be found in the freemuxlet_data_out and qc_data_out folders in the repository.

Dimension Reduction

Using standard PCA calculations and UMAP dimension reduction we get something very similar to what we had before:

cds_previz

Performing batch correction gives us this:

cds_aligned_previz

Basically the T cells look almost the same, so the major effects are to reduce the variability in the myelod blast cells. I don’t think this is what we want to do so I proceeded with the non-batch-corrected data.

Clustering

As described in the overview, clustering was performed using Leiden and Paritioning methods with default Monocle settings. Top Markers are available in “data_out/partition_markers.csv”. They are almost identical as before. The only thing was that the large cluster of blasts from patient 1 got put in their own cluster this time. This really doesn’t affect our results.

Here are selected top markers. I think something like this would look good for the paper:

marker_gene_dotplot

I think this can take the place of the single UMAP scatter plots for the most part.

Here are the UMAP plots with partition assignments overlaid and colors according to response:

all_cells_by_response

Here are all cells colored by patient:

all_cells_by_pt

Both are basically identical to what we had before.

I think it would be worthwhile to specifically highlight AML blast genes like this:

aml_blast_markers

Vicki: if you send me the number of cells in each phenotypic category: B, T, DC, Blast - identified by cytof, I can put together a correlation matrix.

HGF expression

One important finding is that HGF expression seems to be higher in the NR versus the CR group. I think it can be seen pretty well in this figure:

hgf_faceted

The question came up whether HGF and CSF1R expression might be correlated and it looks like the latter is relatively stronger in the CR group, so I think the answer seems to be no from these data:

csf1r_faceted

Gene modules

I don’t know why but there were many fewer gene modules than before. Fortunately the type1 IFN signal is still in there in module 14. The mapk signal ended up in module 2 with a bunch of terminal neutrophil functions.

Here is the module heatmap (myeloid blasts only):

gene_module_rd

I think it is really clear that module 2 is down in the NR and module 14 is up, especially at d0 and d42-44. So I think this is consistent with our prior story. I don’t think we need the other heatmap.

I did the go term analysis with gorilla and revigo as I mentioned above. I’m not happy with how these are being visualized so we may have to just mention the GO terms or select some and make a bar chart of the q values. The problem is that the terms are very highly redundant and even these tools aren’t able to make enough distinction between the two. When I look at the processess and the genes underlying them, they are very different. Module 14 are viral response genes and module 2 is all about neutrophil activation/degranulation and macrophage killing.

I think a good comparison to make is at d0 and d42-44. At these timepoints we can see what the cells are like before and after a full course of ficlatuzumab. The intermediate timepoints are confounded by chemo/maybe not enough time with the ficlatuzumab etc.

So I wanted to see what the aggregate expression of module 14 genes was at the single cell level. Since the module function tells us that these cells are co-regulated across the whole dataset, it makes sense to me to look at them in this way. Monocle generates this module heatmap but adding up the log-transformed, size factor-corrected counts for each gene in a given module for each cell. Then it scales all of these values across all modules. Then it takes the mean of each module value for an arbitrary group of cells and that is what the color is.

So I just did the first steps manually and generated an aggregate module score for each cell in the myeloid blasts at d0 and d42-44. Here are these data:

mod14_d0_agg_violin

mod14_d42_44_agg_violin

The p-values on these figures are calculated using wilcoxon’s test. Because there may be covariates that explain this difference indepenedent of response to ficlatuzumab, I performed regression analysis using the formula ~response + VWsample. After removing the effect due to the VWsample (patient) variable, there was still a significant difference.

This method helps with the zero-inflation problem you have when looking at single genes in isolation.

Ravi and Arjun previously had mentioned pseudobulk analysis in passing in one of our conversations. The idea as I understand it is that the repeated measures involved in single cell overestimate differences between groups of cells and don’t adequately account for biological variation between patients. My personal opinion is that this is so rigorous as to be unhelpful in a study like this. In fact if this is what you want to do, it would have been better to do actual bulk RNA seq, instead of single cell, but maybe I am way off base. There weren’t any signficantly different genes from module 14 when I did pseudobulk analysis.

I did generate lists of single genes using standard regression techniques which you can find in “data_out/response_patient_model_coefs_d0_sig.csv” and “…d_42-44_sig.csv”. I don’t think these plots are particularly compelling because of gene dropout across so many diverse cells. So maybe we won’t show them.

Survival analysis

Because the reviewers asked for it I went ahead and remade the survival figs. They aren’t perfect but do include the Number at Risk table:

pfs_plot_fixed

os_plot_fixed

hsct_os_plot_fixed

The package I used randomly stamps the P value on there which is annoying. When I make the PDF I should be able to move it around to get of off of the curves.

Conclusion

Looks like we have addressed a lot of the concerns. We should be able to quickly correlate the number of cells in the major cell types between cytof and scRNAseq and call it a day. I don’t think there needs to be much or any more harmonization than that.

Let me know if there are any other analyses you’d like to see. Or any aesthetic changes. I’ll put all of my stuff in PDF format and then Vicki, you or I can composite all of the figures together for optimal readability.

Also if I left anything out or made any glaring errors, please let me know.

Thanks!!

Addendum July 27 2020

Here is the gene module heatmap for all partitions:

gene_module_p_a_r

Here is the gene module heatmap for just the blast cells only at d0 and d42-44

gene_module_rd_0_42_44